# Multi-image understanding
Phi 3.5 Vision Instruct
MIT
Phi-3.5-vision is a lightweight and advanced open-source multimodal model that supports a 128K context length and focuses on processing high-quality, inference-rich text and visual data.
Image-to-Text
Transformers Other

P
FriendliAI
370
0
Pixtral 12b
Pixtral is a multimodal model based on the Mistral architecture that can handle image and text inputs and generate text outputs.
Image-to-Text
Transformers

P
saujasv
2,168
0
Pixtral 12b
Pixtral-12B is a multimodal model compatible with the transformers library. It can handle image and text inputs and generate text outputs, suitable for image understanding and description tasks.
Image-to-Text
Transformers

P
mgoin
1,943
1
Featured Recommended AI Models